I’m implementing a document analysis system that needs to locate specific text segments within larger documents. Given a reference text snippet, I need to find where this content appears in the original document(span), even when there might be slight differences in formatting, punctuation, or wording.
I’d like to know:
-
The formal NLP/IR terminology for this type of task. Is this considered “approximate string matching,” “span detection” or something else? Having the correct terminology will help me research existing literature and solutions. I’ve done some research on “span detection”/“span extraction”, but they might not suit my scenario that much? Because I found they’re more focused on biology or different NLP tasks like emotion extraction or Named Entity Recognition.
-
Recommended approaches for solving this specific problem: